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Abstract 



A central problem in analyzing networks is partitioning them into modules or communities, clusters 
with a statistically homogeneous pattern of links to each other or to the rest of the network. One of 
the best tools for this is the stochastic block model, which in its basic form imposes a Poisson degree 
distribution on all nodes within a community or block. In contrast, degree-corrected block models 
allow for heterogeneity of degree within blocks. Since these two model classes often lead to very 
different partitions of nodes into communities, we need an automatic way of deciding which model 
is more appropriate to a given graph. We present a principled and scalable algorithm for this model 
selection problem, and apply it to both synthetic and real-world networks. Specifically, we use belief 
propagation to efficiently approximate the log-likelihood of each class of models, summed over all 
community partitions, in the form of the Bethe free energy. We then derive asymptotic results on the 
mean and variance of the log-likelihood ratio we would observe if the null hypothesis were true, i.e. 
if the network were generated according to the non-degree-corrected block model. We find that for 
sparse networks, significant corrections to the classic asymptotic likelihood-ratio theory (underlying 
X 2 hypothesis testing or the AIC) must be taken into account. We test our procedure against two 
real- world networks and find excellent agreement with our theory. 



1 Introduction 

In many real-world networks, nodes divide naturally into communities, clusters with dense internal ties which are only 
weakly connected to the rest of the graph. More generally, they can divide into modules or functional communities, 
where nodes in the same group connect to the rest of the network in similar ways. Discovering such communities is 
an important part of modeling networks E4ll . as community structure offers clues to the processes which generated 
the graph, on scales ranging from face-to-face social interaction [31 1 through social-media communications [ 1 ] to the 
organization of food webs ||3]|T8). Since communities often reflect functional groupings in the underlying system, 
community membership is also useful for predicting the attributes of nodes, for predicting links between nodes, and 
for statistically controlling for unobserved node attributes. 
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The stochastic block model (SBM) |[TT1 [151 l27l l2l has, deservedly, become one of the most popular generative models 
for community detection. It splits nodes into communities or blocks, within which all nodes are stochastically equiv- 
alent [28 1 . That is, the probability of an edge between any two nodes depends only on which blocks they belong to, 
and all edges are independent given the nodes' block memberships. Block models are highly flexible, representing 
assortative, disassortative and satellite community structures, as well as combinations thereof, in a single generative 
framework ETll22ll . Their asymptotic properties, including phase transitions in the detectability of communities, can 
be determined exactly using tools from statistical physics lfT0l l9l. 

Despite this flexibility, SBMs impose real restrictions on networks; notably, the degree distribution within each block 
is asymptotically Poisson. This makes the SBM implausible for many real-world networks, where the degrees within 
each community are highly inhomogeneous. Fitting the SBM to such networks tends to split the high- and low- degree 
nodes in the same community into distinct blocks; for instance, dividing both liberal and conservative political blogs 
into high-degree "leaders" and low-degree "followers" fl][T6). To avoid this effect, and allow degree inhomogeneity 
within blocks, there is a long history of generative models where the probability of an edge depends on node attributes 
6 U as well as their group memberships (e.g. lfT9ll25ll ). Here we use the variant due to fl6l . called the degree-corrected 
(DC) block model, where the expected number of edges between u and v is proportional to 8 U 8 V . 

We often lack the domain knowledge to choose between the ordinary and the degree-corrected block model, and so are 
faced with a classic problem of statistical model selection. The classic frequentist approaches to model selection are 
largely based on likelihood ratios [6|, and we follow that approach here. Since both SBM and DC models have many 
hidden variables, calculating likelihood ratios is itself non-trivial; the likelihood must be summed over all partitions of 
nodes into blocks, so (in statistical physics terms) the log-likelihood is a free energy. We approximate this free energy 
using belief propagation, giving a highly scalable algorithm that can deal with large sparse networks in nearly linear 
time. However, even with the likelihoods in hand, it turns out that the usual \ 2 theory for likelihood ratios relies on 
approximations which are invalid in our setting, because of the dependency and sparsity of network data. We derive 
the correct asymptotics under certain assumptions, recovering the classic asymptotics in the limit of dense graphs, but 
finding that significant corrections are needed in the sparse case. Numerical experiments confirm the validity of our 
expressions, and we apply our method to a range of real and synthetic networks. 

2 Poisson Stochastic Block Models 

We have an observed graph G with n nodes and m edges. We assume G is undirected, though the directed case is only 
notationally more cumbersome. We want to split the nodes into k communities, taking k to be given a priori. (We will 
address estimating k elsewhere.) To do this, we need to decide whether to use the ordinary or the degree-corrected 
block model. 

Traditionally, stochastic block models are applied to simple graphs, where each entry A uv of the adjacency matrix 
follows a Bernoulli distribution. Following e.g. fl6l . we use a multigraph version of the block model, where the A uv 
are independent and Poisson-distributed. (For simplicity, we ignore self-loops.) In the sparse network regime we are 
most interested in, this Poisson mode differs only negligibly from the original Bernoulli model 11231 . but the former is 
easier to analyze. 

2.1 The Ordinary Stochastic Block Model 

In both models, each node u has a latent variable g u g {1, . . . , k} indicating which of the k blocks it belongs to. The 
block assignment is then g — {g u }. The g u are IID draws from a multinomial distribution parameterized by 7, where 
7 r = P(g u = r) is the prior probability that a given node belongs to block r. Thus g u ~ Multi(7). After it assigns 
nodes to blocks, each model generates the number of edges A uv between each pair of nodes u and v by making an 
independent Poisson draw for each pair. In the ordinary stochastic block model, the means of these Poisson draws are 
specified by the k x k block affinity matrix cj, so A uv \g ~ Poi(ujg u g v ). If we could observe the block assignment g 
along with G, the "complete data" likelihood would be 

P (G, 9 1 u , 7) = n 7*, n <7 fT 3v - n ri ^e-^— « 

u u<v r r,s— 1 u<v 

Here n r denotes the number of nodes in block r, and m rs denotes the number of edges connecting block r to block s, 
or twice that number if r — s. The last term is constant in the parameters, and is identically 1 for simple graphs, so we 
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will discard it in what follows. The log-likelihood is then 

log P(G, g | w, 7) = ^2 n r l°g It + i ( ^2 m ™ lo § Urs ~ n rn r Urs J • (2) 

r \r,s= 1 / 

Maximizing |2} over 7 and w gives 

7r = — , W rs = (3) 

n n r n s 

Of course, the block assignments g are not observed, but rather are what we most want to infer. We could try to infer 
g by maximizing |2) over w, 7 and g jointly; in terms borrowed from statistical physics, this amounts to finding the 
ground state g that minimizes the energy — log P(G, g \ oj, 7). When this g can be found, it recovers the correct g 
exactly if the graph is dense enough [5 1. But if we wish to infer the parameters 7, oj, or to perform model selection, 
we are interested in the total probability that the block model generates the network at hand. This is 

P(G|w,7)=5^P(G, fl |w,7), 

9 

where the sum is over all k n possible block assignments. Again following the physics picture, this is the partition 
function of the Gibbs distribution of g, and its logarithm is (minus) the free energy. 



As is usual with latent variable models, we can infer 7 and u) using an EM algorithm |20|, where the E step ap- 
proximates the average over g with respect to the Gibbs distribution, and the M step estimates 7 and to in order to 
maximize that average. One approach to the E step would use a Monte Carlo Markov Chain (MCMC) algorithm to 
sample g from the Gibbs distribution. However, as we will see below, in order to determine 7 and ui it suffices to 
estimate the marginal distributions of g u of each u, and joint marginal distributions of (g u , g v ) for each pair of nodes 
u, v lfl2l[T7l l4l. As we show in f|3] belief propagation efficiently approximates both the free energy — log P(G \ lj, 7) 
and these marginals, and for many networks it converges very rapidly. Other methods of approximating the E step are 
certainly possible, and could be used with our model-selection analysis. 

2.2 The Degree-Corrected Block Model 

As discussed above, in the SBM any two nodes in the same block have the same degree distribution. Moreover, 
their degrees are sums of independent Poisson variables, so this distribution is Poisson. As a consequence, the SBM 
"resists" putting nodes with very different degrees in the same block. This leads to problems with real networks where 
the degree distribution is highly skewed. 

The degree-corrected (DC) model extends the SBM to allow for heterogeneity of degree within blocks. Nodes are 
assigned to blocks as before, but each node also gets an additional parameter U , which scales the number of edges 
connecting it to other nodes. Thus 

A uv \g ~ Poi{9 u 9 v ujg u gJ 

Varying the 9 U gives any desired degree sequence, at least in expectation. Since setting U = 1 for all u recovers the 
SBM, that model is nested inside the DC model, which is strictly more general. 

The likelihood stays the same if we increase 6 U by some factor c for all nodes in block r, provided we also decrease 
ui rs for all s by the same factor. Thus identification demands a constraint, and a convenient one forces 8 U to sum to 
the total degree within each block: Ylu g =r ® u = S«- fl - r du- We denote this total degree D r . The complete-data 
likelihood of the DC model is then 

P(G,. 9 |^ l7 ) = 11^ II {9 " 9vU ™ )Am eM-0u0^ gugv ) 

u u<v 

= n n e u n <"" /2 *M-\D r D S u rs ) n , (4) 

r u rs u<v 

where n r and m rs are as before. Again ignoring the constant term, the log-likelihood is 

log P(G,g | 0,w,7) = y^n r logjr +y^d u log9 u + - | y^m rs logons - D r D s u rs j . (5) 

r u \ rs / 
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Maximizing |5]l yields the MLEs 

n , ^ n r ^ ni rs 

o u = d u , 7 r = — , uj rs = — — — . (6) 
n LJ r U s 

However, as with the ordinary SBM, we will estimate 7 and uj not just for a ground state g, but using belief propagation 
to find the marginal distributions for g u and pairwise marginals for (g u , g v ). 

3 Belief Propagation and the Bethe Free Energy 

We referred above to the use of belief propagation for computing free energies and marginal distributions of block 
assignments. Here we describe how belief propagation works for the degree-corrected block model, extending the 
treatment of the SBM in ifTUl [£)■ The key idea ll29l is that each node u sends a message to every other node v, 
indicating the marginal distribution of g u if v were absent. We write u for the probability that u would be of type 
r in the absence of v. Then fi u ^ v gets updated in light of the messages u gets from the other nodes as follows. Let 

f(6 u ,6 v , u rs , A uv ) — - - " rs ^ exp(— u 6 v L) rs ) (7) 



denote the probability that A uv takes its observed value assuming that g u = r and g v = s. Then 

k 

,u— >v 



w^u,v s—1 

where Z u ^ v is a normalization factor set so that J2 r Hr^ v = 1- As usual in belief propagation, we assume here that 
the block assignments g w of the other nodes are independent conditioned on g u . 

Note that each node sends messages to every other node, not just to its neighbors, since non-edges are also informative 
about g u and g v . Thus we have a Markov random field on a weighted complete graph, as opposed to just on the 
network itself. However, keeping track of n 2 messages is cumbersome. For sparse networks, we can restore scalability 
by noticing that, up to 0(l/n) terms, each node u sends the same message to all of its non-neighbors. That is, for any 
v such that A uv = 0, we have fi^ v = ^ where 

k 

^^111^7(^,^,^,4™). (9) 

w^u s—1 

This simplification reduces the number of messages to 0(n + m). We can then write 

Since the second product depends only on 9 U , we can compute it once for each degree in the network, and then update 
the messages for each u in 0(k 2 d u ) time. Thus, for fixed k, the total time it takes to update all the messages is 
Oim + in), where £ is the number of distinct degrees. As discussed in [9 1, for many networks only a constant number 
of updates are necessary in order to reach a fixed point, making the entire algorithm quite scalable. 

The BP estimate of the joint marginals Pr[g u — r,g v — s] is oc f(8 u , V , ui rs , A uv ) ^:~^ v ix v s ~^ u , normalized so 
that ^ rs 6"J = 1. The M step of the EM algorithm sets 7 and u analogously to |6]), 

lr = £ = ^ , Un = ^-=( J2 A ^ri) I { E « E ■ (10) 

Belief propagation also lets us approximate the partial-data likelihood, i.e., the total probability summed over g that 
the model generates G. The Bethe free energy is the following approximation to the log partition function lf30ll : 



log P(G|0,w,7) w E lo S ZU - E lo S 



\Y J "rsD r D s . (11) 



We reiterate that while we use belief propagation in our numerical work, our results on model selection in the next 
section are quite indifferent as to how the likelihood is maximized, or how the free energy is computed. 



4 



4 Model Selection 



When the degree distribution is relatively homogeneous within each block (e.g. lfTTl[T5ll ), the ordinary stochastic block 
model is better than the degree-corrected model, since the extra parameters 6 U simply lead to over-fitting. On the other 
hand, when degree distributions within blocks are highly heterogeneous, DC is better. The challenge comes when each 
model offers a different partition; for instance, when the SBM divides blogs into high- and low-degree groups, and DC 
divides them according to political leanings. If we lack prior information about which model is a better account of the 
network, we need to use the data to pick a model, i.e., to do model selection J6). 

From the machine-learning perspective, the natural impulse is to reach for multi-fold cross-validation. Unfortunately, 
because network data is globally dependent, there is as yet no good way to split a given into training and testing sets for 
cross-validation. Predicting missing links or tagging false positives are popular forms of leave-fc-out cross-validation 
in the network literature [7 14], but the latter does not converge on the true model even for IID data 151 . 

Instead, we approach this problem statistically, as one of hypothesis testing. Since the ordinary SBM is nested within 
the DC model, any given graph G must be at least as likely under the latter as under the former. Moreover, if the 
SBM really is the better model, the DC should converge to it, at least in the limit of large networks. Our null model 
Hq = {7j then is the SBM, and the larger, nesting alternative H\ = {8, 7, u>} is the DC model. The appropriate 
test statistic is the log-likelihood ratio, 

urA . su PHl E g IL 7?- IL gjj" Urs ^" rs/2 gH^W 

A(G) = log ^ n », n — — ^ — (12) 

su PH Z^ 9 llr7r H rs W rs exp(- § n r n s UJ rs ) 

We reject the null model in favor of the more elaborate alternative when A exceeds some threshold. This threshold, in 
turn, is fixed by our desired error rate, and by the distribution of A when G is generated from the null model. When 
G is small, the null-model distribution of A can be found through parametric bootstrapping JS]: fitting Hq, generating 
new graphs G from it, and evaluating A(G). When n is large, however, it would be helpful to replace bootstrapping 
with analytic calculations. 

A classic result in asymptotic statistics [26 1 asserts that in hypothesis-testing problems like this, the large-sample null 
distribution of A is 2A(G) ~ Xe< where I is the number of constraints that must be imposed on Hi to recover Hq. In 
this case we have I = n — k, as we must set all n of the 9 U to 1, while our identifi ability convention J2 u g = r ^« = D r 
already imposed k constraints. 

However, deriving the \ 2 distribution relies on a key assumption ll26l \l3l : namely, that the log -likelihood of both 
models is well-approximated by a quadratic function in the vicinity of its maximum, so that the parameter estimates 
have Gaussian distributions around the true model. The most common grounds for this assumption are central limit 
theorems for IID data, or more generally, being in a "large data limit." We will see that, for sparse networks, this 
assumption does not hold for the parameters 9 U . Nevertheless, with some work we are able to compute the mean and 
variance of A's null distribution. While we recover the classical x 2 distribution in the the limit of large, dense graphs, 
there are significant corrections when the average degree of the graph is small. There corrections need to be taken into 
account in order to solve this model selection problem correctly. 

To obtain theoretical estimates of the null distribution of A, we assume that the Gibbs distribution of both models 
is concentrated on the same block assignment g. This is a major assumption, but it is borne out by our experiments 
(Fig.[T]and|2]l, and the fact that under some conditions [5 | the SBM recovers the underlying block assignment exactly. 
Under this assumption, while the free energy differs from the ground state energy by an entropy term, the free energy 
difference between the two models has the same distribution as the ground state energy difference. The MLE estimates 
for Hq and Hi are then given by <|3j and |6]l respectively. Substituting these into ( p"2| ) gives A the form of a Kullback- 
Leibler divergence, 

aw> - *n* n fep' 2 - i° s n - . <»> 

u rs v ' u v ' u yv- 

where d 9u = D 9u /n u is the average degree of u's block. Note that d r is the empirical mean, not the expected degree 
Mr = J2s Is^rs of the true underlying SBM. 

We can understand the asymptotic null distribution of A by assuming that the d u in each block r are IID and Poisson 
with expectation /i r . This assumption is sound in the limit n — > 00, since the correlations between node degrees are 
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estimated probability of being in block 1 




SbTi 



Figure 1: Joint density of posterior probabilities over block assignments, showing that the SBM and the DC are 
concentrated around the same ground state. The synthetic network had n — 10 3 , k = 2, 71 = 72 = 1/2, /i r = 11, 
^12/^11 = W21/W22 = 1/11. The x and y axes are the marginal probabilities of being in block 1 according to the 
SBM and DC models. 

0(l/n). In that case, we can compute the expectation and variance of A analytically (see Appendix [A). These results 
show how the behavior of A differs from naive \ 2 asymptotics, as well as revealing the limits where the naive results 
apply. Specifically, we have 

E[A]=^n r /(/i r )-/(rv// r ) (14) 

r 

where if d is Poisson with mean /1, 

f(p) = E[dlogd] - ^logpi = ^2 d \ dlogd- [ilogn. (15) 

In the limit /i — > 00, i.e., for dense graphs, both f(/i) and /(n/i) approach 1/2, and ( fT4| i gives E[A] = (n — fc)/2 just 
as in the standard x 2 analysis. However, when /1 is finite, /(/i) differs significantly from 1/2. 

The variance of A is more complicated, but still calculable. The limiting variance per node is 

lim -VarfA] = j r v(fi r ) , (16) 

r 

where, again taking d to be Poisson with mean /1, 

v(p) =^(l + log/.t) 2 +Var[dlog<i]-2(l + logAi)Cov[rJ,rJlogd] . (17) 

Since the variance of \ 2 is 2f, the \ 2 analysis would predict (l/n)Var[A] = 1/2. Indeed v(fi) approaches 1/2 in 
the limit /n 00, but like /(/i) it differs significantly from 1/2 for finite fi. Plots of and v(n), and leading 
corrections to the classical asymptotics, are given in Appendix \K\ 

Why exactly does the null distribution of A differ from the usual \ 2 distribution? The reason is that the parameters 9 U 
are not in the large data limit. We have one observation for each node, i.e., its degree d u . If a Poisson distribution has 
small mean, its shape differs significantly from a Gaussian, and so does the posterior distribution of the mean based 
on a single sample. In particular, P(8 \ d) follows a Gamma distribution, if the prior on 9 is uninformative l32l . When 
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I* theoretical quantiles 

(a) (b) (c) 

Figure 2: (a) f(fi) from (JT3J, the expected log-likelihood difference per node, compared to simulation results; (b) 
the asymptotic variance of the log-likelihood difference per node, from ([TTJi, with simulation results; (c) QQ plots 
comparing the distribution of log-likelihood differences from 10 4 synthetic networks with /i = 3 to a Gaussian with 
the theoretical mean and variance, showing that the free energy difference and the ground state energy difference have 
similar distributions. All simulations used n = 10 4 , k = 2, 71 = 72 = 1/2, and w^/wn = 0.15, W11/W22 = 1> m ( a ) 
and (b), each point is the average over 10 3 networks, including 95% bootstrap confidence intervals. 



the degrees are large, both the sample distribution and the posterior become Gaussian, and the x 2 analysis takes over; 
but when they are small, the geometry is simply different, causing /(/i) and v(fi) to differ from 1/2. 

As shown in Fig. [2j experiments on synthetic networks generated from the SBM show that the mean and variance 
of A are very well fit by our theoretical results. We have not attempted to compute higher moments of A. However, 
if we assume that d u are independent, then the central limit theorem applies, and A follows a Gaussian distribution 
in the limit of large n. Quantile plots from the same experiments (Fig. |2|c)) show that a Gaussian with mean and 
variance given by ( fl4] i and ( fTo*) is indeed a good fit. Moreover, the free energy difference and the ground state energy 
difference have similar distributions, as implied by our assumption that both Gibbs distributions are concentrated 
around the ground state. Interestingly, in Fig. |2jc), the degree is low enough that this concentration must be imperfect, 
but our theory still holds remarkably well. 

5 Results on real world networks 

We have derived the theoretical null distribution of A, and backed up our calculations with simulations. We now apply 
our theory to the real world, considering two examples studied in |fl6l . 

5.1 Zachary's karate club 

This is a social network consisting of 34 members of a karate club, where undirected edges represent friendships [31 1. 
The club split into two factions, one centered around the instructor and the other around the club president. The 
network is thus made up of two assortative blocks, each with a high degree hub and lower-degree peripheral nodes. 

The authors of lfT6l compared the perfomance of SBM and DC on this network, and heavily favored DC over SBM 
because the former leads to a community structure agreeing with the ground truth. Our test, however, shows that 
the evidence is not strong enough to reject the null SBM model with any great confidence. As shown in Fig. |3ja), 
the distribution of A from bootstrap experiments is fit reasonably will by a Gaussian with our predicted mean and 
variance. The observed A = 20.7 has a p-value of 0.19 according to the theoretical Gaussian, and 0.15 according to 
the bootstrap distribution. Thus a prudent statistician would think twice before embracing the additional n parameters 
of DC. Indeed, in a study of active learning, the authors of [ 18 1 found that SBM labels most of the nodes correctly if 
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Karate club Political blogs 

CCDF CCDF 




(a) (b) 



Figure 3: Hypothesis testing of real world networks, (a): Zachary's karate club [31 1, where n = 34. The CCDF (com- 
plementary cumulative distribution) of the log-likelihood ratio A under the null model is estimated using bootstrapping 
(shaded), and is fit reasonably well by the CCDF of a Gaussian (curve) with our theoretically predicted mean and vari- 
ance. The observed A = 20.7 (marked with the red line) has p-values of 0.15 and 0.19 according to the bootstrap and 
theoretical distributions respectively, (b): A network of political blogs [ 1| where n = 1222. The bootstrap distribution 
(shaded) is very well fit by a Gaussian (curve) with our predicted mean and variance. The actual log-likelihood ratio 
is so far in the tail (see inset) that its p-value is effectively zero. Thus for the blog network, we can decisively reject 
the ordinary block model in favor of the degree-corrected model, while for the karate club, the evidence is less clear. 

we fix the block assignment of the instructor and the president to 1 and 2 respectively. This implies that the degree 
inhomogeneity is not too extreme, and that only a handful of nodes are responsible for the better performance of DC. 

5.2 Political blogs 

The second example is a network of political blogs in the US assembled by Adamic and Glance (T|. As in (T6), we 
focus on the giant component, which consists of 1222 blogs and 19087 links between them, as captured on a single 
day in 2005. The blogs have known political leanings, and were labeled as either liberal or conservative. The network 
is assortative and has a highly right-skewed degree distribution within each block. 

In its agreement with ground truth, DC substantially outperforms SBM, as observed in |[T6l . This time around, our 
hypothesis testing procedure completely agrees with their choice of model. As shown in Fig. |3jb), the bootstrap 
distribution of A is very well fit by a Gaussian with our theoretical prediction of the mean and variance. The observed 
log-likelihood ratio A = 8883 is 330 standard deviations above the mean. It is essentially impossible to produce such 
extreme results through mere fluctuations under the null model. Thus, for this network, introducing n extra parameters 
to capture the degree heterogeneity, and rejecting SBM in favor of DC, is fully justified. 

6 Conclusion 

We have presented a mathematically principled procedure for determining whether the degree-corrected block model is 
justified over the ordinary stochastic block model. We found that for sparse networks, the distribution of log-likelihood 
ratios differs significantly from the naive x 2 analysis, and showed how to compute its mean and variance exactly in 
the large-n limit where node degrees are essentially independent and Poisson. We confirmed our calculations with 
experiments on synthetic networks, and applied our procedure to two real-world networks; one where the ordinary 
block model can be decisively rejected, and another where the evidence is less clear. We hope that similar approaches 
will let us choose between competing generative models for network data, and in particular between other variants of 
the block model such as those in 11321 . 
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Figure 4: The function f(jx) defined in ( |T9| l, or equivalently the expected log-likelihood difference divided by n. We 
compare this with experiment in Fig.[2|a). 



A Behavior of A under the null hypothesis 

For simplicity we focus on one group with expected degree /i. Assuming independence between the groups will then 
recover the expressions ([14} and ( fT6| i where the mean and variance of A is a weighted sum over groups. We have 



A = J>log| 

i=l 

= d i J °g d i " (e lo s (e A + (E 



di loe 



(18) 



where d = (l/n) ^\ di is the sample mean. We wish to compute the mean and expectation of logL if the data is 
generated by the null model. 

If d is Poisson-distributed with mean fi, let /(/x) denote the difference between the expectation of d log d and its most 
likely value /i log fj,: 



I 00 e -p„rf \ 



\d=l 



(19) 



Assume that the d, are independent and Poisson with mean /x; this is reasonable in a large sparse graph, since the 
correlations between degrees of different nodes is 0(l/n). Then J2i d% is Poisson with mean n/i, and ( fT8| i gives 



E[A] = n/0*) - f(nn) . (20) 
To understand this asymptotically, note that /(/i) converges to 1/2 when /i is large. Thus in the limit of large n, 

E[A] = n/(/i) - i . 

When [i is large, this gives E[A] = (n — l)/2, just as % 2 hypothesis testing would suggest. However, as Fig.|4]shows, 
/(/x) deviates significantly from 1/2 for finite fi. We can obtain the leading corrections as a power series in l//z by 
approximating ( fT9] > with the Taylor series of <i log (i around d = fi, giving 



r / \ 1 1 1 

/M = 2 + 127, + 12m 2 



o(i/m 3 
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Computing the variance is harder, but still possible. It will be convenient to define several functions. If d is Poisson 
with mean /i, let (j>(fJ.) denote the variance of dlog d: 

<j>{p) = Var[dlogd] = E[(dlogd) 2 ] -E[dlogd] 2 



° 11 (dlogd) 2 -(/( M )+ M lo gAl ) 2 



E 

d=0 



(21) 



We will also use 



= Cov[d,dlogd] = E[d 2 logd] - fiE[dlogd] 



00 e~^a d 

E d2 log d - M (/(*0 + H log n) 



d=l 



(22) 



Finally, let A > //, and let d and u be independent and Poisson with mean /1 and A — [i respectively. Then let 

r (/x, A) = Cov[d log d, (d + u) log(d + u)] 

= E[(d\ogd)((d + u) log(d + u))] - E[dlogd] E[(d + u) log(d + u)] 

= E M , , W (dlogd)((d + M )log(d + M )) (23) 

d,u— 1 

-(/(/i) + /ilog/x) (/(A) + A log A) , 
where we used the fact that d + u is Poisson with mean A. 

Then again assuming that the dj are independent, we have the following terms and cross-terms for the variance of (IB) : 



E dj log d, 

i 



Cov 



Var d, 

i 

^ dj logdj, ( ^ d iJ log ( E 



Cov 



Cov 



E di J lo § E dl ) ' E d ' 



n<f>(fi) 
<f>(n[i) 

nr(fj,, n\i) 

nc(fx) 

c{njjL) 



Putting this all together, we have 

Var[A] = + 4>{nn) + n/xlog 2 n — 2nr(/i, nfi) + 2(nc(fi) — c(n/x)) logn . 



(24) 



In the limit of large fi, using Taylor series to expand the summands of pTj ) and ( |22] > gives the following simplifications: 



<f>(fi) = fi log- fj, + 2(i log H + -+0 



log_M 



c(m) = /xlog/i + m + O(Vm) • 
Also, when A 3> /x and /j = O(l), using log(d + u) « log it + d/u lets us separate the double sum in |23|, giving 

r(>,A) = E[d 2 logd] (1 + logA) +E[dlogd]E[ulogu] 
- E[dlogd] E[(d + u) log(d + u)]) + 0(1/A) . 
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Figure 5: The asymptotic variance of the log-likelihood difference, divided by n, given in ( |25) , We compare this with 
experiment in Fig.[2|b). 



In particular, setting A = n/i gives 



r(fi,nfi) = c(^)(l + logn^i) + 0(1 /n) 



Finally, keeping 0(n) terms in ( p4j > and defining as in ( fT6| > gives 



= lim -Var[A] = (f>(fj,) + + log/^) 2 - 2c(/x)(l + log^i) . 

n^-oo 77, 



(25) 



Using the definitions of <ft an d c, we can write this more explicitly as (where Var and Cov denote the variance and 
covariance in the Poisson distribution with mean /i) 

v(fi) = + log [if + Var[d log d] - 2(1 + log /i) Cov[d, d log d] 

= /i(l + logM) 2 



+ J2 ( rf lo S d) (dlog d - 2(1 + log /x) (d-p)) 



v<i=i 



(26) 



We plot this function in Fig. [5] It converges to 1/2 in the limit of large /i, but it is significantly larger for finite 
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